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ABSTRACT 

In  this  paper  we  implement  various  linear  and  nonlinear  subspace-based  anomaly  detectors  for  hyperspectral 
imagery.  First,  a  dual  window  technique  is  used  to  separate  the  local  area  around  each  pixel  into  two  regions  -  an 
inner- window  region  (IWR)  and  an  outer- window  region  (OWR).  Pixel  spectra  from  each  region  are  projected 
onto  a  subspace  which  is  defined  by  projection  bases  that  can  be  generated  in  several  ways.  Here  we  use  three 
common  pattern  classification  techniques  (Principal  Component  Analysis  (PC A),  Fisher  Linear  Discriminant 
(FLD)  Analysis,  and  the  Eigenspace  Separation  Transform  (EST))  to  generate  projection  vectors.  In  addition 
to  these  three  algorithms,  the  well-known  Reed-Xiaoli  (RX)  anomaly  detector  is  also  implemented.  Each  of  the 
four  linear  methods  is  then  implicitly  defined  in  a  high-  (possibly  infinite-)  dimensional  feature  space  by  using  a 
nonlinear  mapping  associated  with  a  kernel  function.  Using  a  common  machine-learning  technique  known  as  the 
kernel  trick  all  dot  products  in  the  feature  space  are  replaced  with  a  Mercer  kernel  function  defined  in  terms  of 
the  original  input  data  space.  To  determine  how  anomalous  a  given  pixel  is,  we  then  project  the  current  test  pixel 
spectra  and  the  spectral  mean  vector  of  the  OWR  onto  the  linear  and  nonlinear  projection  vecotrs  in  order  to 
exploit  the  statistical  differences  between  the  IWR  and  OWR  pixels.  Anomalies  are  detected  if  the  separation  of 
the  projection  of  the  current  test  pixel  spectra  and  the  OWR  mean  spectra  are  greater  than  a  certain  threshold. 
Comparisons  are  made  using  receiver  operating  characteristics  (ROC)  curves. 

Keywords:  Anomaly  detection,  hyperspectral  imagery,  Eigenspace  Separation  Transform  (EST),  kernel-based 
machine  learning,  kernel  PCA,  kernel  Fisher  discriminant,  kernels. 

1.  INTRODUCTION 

Anomaly  detectors  are  pattern  recognition  schemes  that  are  used  to  detect  objects  that  might  be  of  military 
interest.  Almost  all  anomaly  detectors  attempt  to  locate  anything  that  looks  different  spatially  or  spectrally 
from  its  surroundings  using  a  dual  rectangular  window  approach.1  In  spectral  anomaly  detection  algorithms, 
pixels  (materials)  that  have  a  significantly  different  spectral  signature  from  their  neighboring  background  clutter 
pixels  are  identified  as  spectral  anomalies.  Spectral  anomaly  detection  algorithms1  8  could  also  use  spectral 
signatures  to  detect  anomalies  embedded  within  background  clutter  with  a  very  low  signal-to-noise  ratio.  In 
spectral  anomaly  detectors,  no  prior  knowledge  of  the  target  spectral  signature  is  utilized  or  assumed. 

One  way  of  designing  an  anomaly  detector  is  by  projecting  the  input  spectra  onto  a  subspace  whose  bases 
are  defined  by  some  projection  vectors.  In1  researchers  compared  subspace-based  anomaly  detection  algorithms 
using  projection  vectors  which  were  generated  using  three  common  pattern  recognition  techniques  -  Principal 
Component  Analysis  (PCA),9  Fisher  Linear  Discriminant  (FLD)  Analysis,10  and  the  Eigenspace  Separation 
Transform  (EST).11  In  addition,  they  compared  the  performance  of  a  standard  anomaly  detection  algorithm, 
the  so  called  Reed-Xiaoli  (RX)  detector,5  with  the  performances  of  their  subspace-based  anomaly  detectors. 

In  many  situations,  however,  a  linear  classifier  is  not  always  sufficient;  that  is,  most  real-world  data  are  not 
linearly  separable.  Furthermore,  most  data  do  not  fit  the  Gaussian  distribution  assumption  made  by  the  RX 
algorithm.  However,  by  using  a  nonlinear  mapping  to  transform  each  spectrum  into  a  high-dimensional  (possibly 
infinite-dimensional)  feature  space  we  can  potentially  exploit  higher  order  correlation  between  the  spectral  bands, 
something  that  is  not  possible  in  the  linear  anomaly  detectors.  The  resulting  linear  hyperplane  separating  the 
anomalies  from  the  background  in  the  high-dimensional  feature  space  corresponds  to  a  nonlinear  boundary  in 
the  original  input  space.  Unfortunately,  it  is  computationally  infeasible  to  carry  out  any  algorithms  in  this  high 
dimensional  feature  space.  However,  this  problem  can  be  circumvented  by  using  a  common  machine  learning 
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Figure  1.  An  example  of  a  dual  window  with  guard  band.  The  numbers  represent  the  length  in  pixels  making  up  the  side 
of  each  window.  Each  ‘x’  in  the  IWR  represents  one  pixel.  The  pixel  in  red  represents  the  current  test  pixel.  The  figure 
is  not  necessarily  drawn  to  scale. 

technique  known  as  kernelization.  By  kernelizing  the  algorithm,  all  dot  products  between  mapped  vectors  in 
the  feature  space  are  instead  computed  using  a  predetermined  kernel  function  on  the  input  data.  Moreover,  this 
technique  will  significantly  simplify  the  mathematical  computation. 

This  paper  also  examines  the  performance  of  the  kernel  versions  of  each  of  the  four  methods  (PC  A,  FLD,  EST 
and  RX)  described  above  and  applies  each  one  to  the  anomaly  detection  problem.  More  specifically,  Kernel-RX 
(KRX),6  Kernel  Principal  Component  Analysis  (KPCA),12  Kernel  Fisher  Discriminant  (KFD),13  and  Kernel 
Eigenspace  Separation  Transform  (KEST),  which  is  introduced  in  this  paper,  are  all  implemented  and  their 
performances  are  compared  against  each  other  as  well  as  against  their  linear  counterparts. 

This  paper  is  structured  in  the  following  manner.  In  Section  2  an  introduction  to  subspace-based  anomaly 
detection  can  be  found  and  brief  descriptions  of  the  three  linear  methods  which  are  used  to  generate  the  projection 
vectors  are  given.  Section  3  contains  a  brief  introduction  to  kernel-based  learning  techniques  while  Section  4 
extends  each  of  the  three  linear  methods  from  Section  2  into  their  respective  nonlinear  kernelized  methods.  The 
RX  and  Kernel-RX  Algorithms  are  presented  in  Section  5.  Results  and  analysis  of  all  four  methods  and  their 
kernel  version  as  applied  to  simulated  data  as  well  as  multiple  hyperspectral  data  sets  can  be  found  in  Section 
6.  Finally,  concluding  remarks  are  made  in  Section  7. 

2.  LINEAR  SUBSPACE-BASED  ANOMALY  DETECTION 

One  common  method  used  in  many  anomaly  detection  algorithms  is  the  dual-window  approach  Its  use  is  pred¬ 
icated  on  the  fact  that  it  exploits  both  spatial  variability  in  the  image  as  well  as  spectral  variability  among 
different  materials.  At  each  pixel  location  concentric  rectangular  windows  centered  at  the  test  pixel  are  opened 
creating  two  disjoint  regions  -  an  inner  window  region  (IWR)  and  an  outer  window  region  (OWR).  Hence,  the 
local  pixel  neighborhood  is  separated  into  two  smaller  regions.  The  size  of  the  inner  window  is  generally  set 
so  that  the  inner  window  can  fully  enclose  a  target.  In  most  anomaly  detectors  another  concentric  rectangle 
centered  at  the  test  pixel  known  as  the  ‘guard  band’  is  utilized  as  well.  An  example  of  a  dual-window  with  guard 
band  is  seen  in  Figure  (1).  The  guard  band  is  slightly  larger  in  size  than  the  IWR  yet  still  smaller  than  the 
OWR.  The  main  purpose  of  the  guard  band  is  to  reduce  the  probability  that  some  target  spectra  will  inhabit 
the  OWR  and  hence  affect  the  background  model.8  In  subspace-based  anomaly  detection  techniques  projection 
(basis)  vectors  are  generated  using  the  statistical  properties  of  the  IWR  and  OWR  covariance  matrices. 

Using  the  eigen- value  decomposition  of  the  covariance  matrices  of  IWR  and  OWR  spectra  it  is  possible 
to  generate  basis  vectors  for  a  subspace  onto  which  vectors  from  the  IWR  and  OWR  are  projected  for  dis¬ 
crimination.  Denote  a  spectral  vector  within  the  IWR  of  a  dual  window  centered  at  a  test  pixel  by  = 
(or/c (1) ,  £/c(2),  . . .  ,x/c(J))T  where  J  refers  to  the  number  of  spectral  bands  and  k  =  1, . . .  ,  7V*n.  Assuming  that 
there  are  a  total  of  Nin  pixels  in  the  IWR,  the  matrix  X  =  [xi,  X2, . . . ,  x Nin ]  is  of  size  J  x  Nin  and  contains  the 
spectra  of  each  one  of  these  samples  as  one  of  its  Nin  columns.  Similarly,  let  a  spectral  vector  which  is  contained 
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within  the  OWR  of  the  same  dual  window  be  denoted  by  where  l  =  1, . . .  ,Nout.  Given  that  there  are  Nout 
pixels  in  the  OWR,  the  J  x  Nout  matrix  Y  =  [y1,y2?  •  •  • ,  y Nout]  *s  one  whose  columns  are  the  spectral  vectors 
of  the  pixels  in  the  OWR  representing  the  background  clutter  samples.  The  background  clutter  statistics  are 
estimated  using  the  spectra  of  the  pixels  in  the  OWR.  The  covariance  matrices  of  the  IWR  and  OWR  spectra 
are  given  by 

Cx  =  ^T-i(X-Ax)(X-Ax)T  (1) 

Cy  =  1  (Y  -  Ay)  (Y  -  Ay)T  •  (2) 

^ out  ~  -L 

where  fix  and  fiY  are  defined  as  the  statistical  means  of  the  IWR  and  OWR  spectra,  respectively.  The  vector 
jjbY  represents  the  estimate  of  the  mean  of  the  background  clutter. 

The  projection  separation  statistic  for  a  input  test  pixel  denoted  by  r  is  calculated  using 

s'  =  (  r-AY)TWWT(r-AY).  (3) 

where  W  =  [wy  w2  . . .  wm]  is  a  matrix  whose  columns  are  m  projection  vectors.  The  product  WWT  is  known 
as  a  projection  operator  and  represents  a  subspace  characterizing  the  spectra  used  to  generate  the  projection 
vectors  wq.  An  anomaly  is  detected  if  the  projection  separation,  s',  is  greater  than  some  threshold,  77.  It  is  also 
possible  to  project  the  difference  (r  —  fiY)  onto  the  complement  subspace  (I  —  WWT)  given  by 

S'  =  (r-AY)T(I-WWT)(r-AY).  (4) 

Equation  (4)  is  only  used  in  some  algorithms  (e.g.  -  PC  A  and  EST).  In  the  experimental  results  section,  only 
the  best  results  between  Equations  (3)  and  (4)  are  reported  and  mention  will  be  made  regarding  which  equation 
was  used.  In  the  following  subsections  three  different  methods  are  used  to  generate  the  projection  vectors,  W, 
in  order  to  obtain  the  projection  operator  in  the  Equations  (3)  and  (4)  . 

2.1  Principal  Component  Analysis 

Principal  Component  Analysis  (PCA)  is  one  of  the  most  commonly  used  methods  for  feature  extraction  and 
dimensionality  reduction.  The  underlying  goal  is  to  find  a  projection  which  best  represents  some  input  data 
in  the  least-squared  sense.9  In  order  to  generate  the  projection  vectors,  wq,  the  background  clutter  covariance 
matrix,  Cy,  is  written  in  terms  of  its  eigenvectors  V  and  their  corresponding  eigenvalues  A  as 

CY  =  VAVt.  (5) 

the  first  m  eigenvectors  with  the  highest  corresponding  eigenvalues  form  the  projection  vectors.  Thus, 

w PCA  =  v  =  [vi,  v2,  •  •  •  ,  vm]  (6) 

where  m  is  a  configurable  constant.  Altering  the  value  of  m  (i.e.  -  changing  the  number  of  eigenvectors  used) 
will  change  the  performance  of  the  anomaly  detector  as  shown  by  experimental  results. 

Using  Equation  (6)  as  the  projection  vectors  and  substituting  this  result  into  Equation  (3)  or  (4)  we  obtain 
the  corresponding  projections  of  the  input  onto  the  subspace  and  the  complement  subspace.  The  PCA  anomaly 
detector  is  given  by 

PCA(r)  =  (r  -  Ay)T  (wfcmW tpca)  (r  -  Ay)-  (7) 

PCA(r)  =  (r  -  Ay)T  (i  -  WPCiW^)  (r  -  Ay)-  (8) 

The  idea  behind  using  the  PCA  eigenvectors  lies  in  the  fact  that  since  these  eigenvectors  are  optimal  (i.e.  - 
they  minimize  the  mean-square  error)  in  terms  of  their  representation  of  the  spectral  vectors  of  the  OWR,  the 
projection  of  the  difference  between  the  test  pixel,  r,  and  the  outer  window  mean,  /CtY,  should  ideally  be  large  if 
the  dual  window  is  centered  on  an  anomalous  target. 
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The  algorithm  outlined  above  can  also  be  developed  using  samples  collected  from  the  IWR.  In  this  case, 
Equations  (7)  and  (8)  remain  the  same  with  the  exception  that  the  projection  vectors  Wj  are  generated  using 
spectral  information  contained  in  the  IWR  rather  than  the  OWR.  In  this  paper,  Equations  (7)  and  (8)  are 
referred  to  as  the  ‘PC A  Algorithm’  or  simply  ‘PC A’.  In  Section  6,  only  the  best  result  among  the  four  possible 
choices  for  PC  A  is  given. 


2.2  Fisher  Linear  Discriminant  Analysis 

Although  PC  A  has  proven  to  be  very  useful  for  efficient  representation  of  data,  it  does  not  exploit  the  information 
in  the  IWR  and  OWR  at  the  same  time  in  order  to  generate  the  target  or  background  subspaces.  However,  Fisher 
Linear  Discriminant  (FLD)  Analysis,10  which  attempts  to  seek  an  optimal  direction  for  discriminating  between 
IWR  and  OWR  data  samples,  does.  First,  the  between-class  scatter  matrix  is  defined  as  S b  —  (Ax  —  Ay)  (Ax  — 
A y)T  while  the  within-class  scatter  matrix  can  be  written  as  =  Cx  +  Cy  where  Cx  and  Cy  are  the 
covariance  matrices  of  the  samples  in  the  IWR  and  OWR  defined  by  Equations  (1)  and  (2),  respectively.  The 
matrix  Sb  is  a  measure  of  how  well  the  means  of  the  two  classes  are  separated  while  the  matrix  Sw  is  a  measure 
of  the  compactness  of  each  class  cluster. 

In  order  to  calculate  the  optimal  discrimination  direction,  w*,  the  criterion  function 


*  /  \  |wTSsw| 

w  =  max  J(w)  =  - — =— - r 

w  |wJ 

needs  to  be  maximized  over  all  possible  w  and  has  been  shown10  to  be  given  by 

wr  =  w*  =  S  within  -  Vout) 


(9) 


(10) 


Using  Equation  (10)  as  the  projection  vector  and  substituting  this  result  into  Equation  (3)  gives 

FLD(r)  =  (r  -  /ty)T  (wpWp)  (r  -  (iY).  (11) 

The  idea  behind  using  FLD  is  that  it  will  produce  a  large  projection  separation  if  the  spectral  means  of  the  IWR 
and  OWR  are  sufficiently  dissimilar  while  the  spectral  vectors  in  each  region  are  tightly  clustered.  In  this  paper, 
Equation  (11)  is  referred  to  as  the  ‘FLD  Algorithm’  or  simply  ‘FLD’. 


2.3  Eigenspace  Separation  Transform 

The  Eigenspace  Separation  Transform  (EST)  was  developed  by  Torrieri11  as  a  preprocessing  technique  to  ex¬ 
tract  features  for  neural  network  classifiers  and  has  been  successfully  used  by  researchers  for  automatic  clutter 
rejection.14  Like  PC  A,  EST  aims  to  extract  features  from  a  training  set  by  projecting  the  input  patterns  onto 
a  lower-dimensional  orthogonal  subspace.  In  this  paper  it  is  used  to  generate  projection  vectors  in  order  to 
separate  target  pixels  from  background  clutter. 

In  EST  algorithm  we  first  compute  the  J  x  J  Difference  Correlation  (DCOR)  matrix 

M  =  Rx  —  Ry  =  2— XXT  —  —I — YYt  (12) 

^ in  ^ out 


where  M  is  simply  the  difference  of  the  correlation  matrices  of  the  IWR  (Rx)  and  OWR  (Ry)  and  represents 
the  second  order  statistic  differences  between  the  two  regions1  . 

The  eigenvalue  decomposition  of  DCOR  can  be  rewritten  in  block- matrix  form  in  terms  of  its  positive  and 
negative  eigenvalues  and  eigenvectors  as 


M  =  [  V+  V_ 


(13) 


where  the  columns  of  V+  and  V_  are  the  eigenvectors  with  their  corresponding  non-zero  positive  (A+)  and 
negative  (A_)  eigenvalues,  respectively. 
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The  matrix  West  is  then  chosen  to  be  the  set  of  m  positive  or  negative  eigenvectors  associated  with  M.  The 
choice  of  which  set  to  use  hinges  on  which  set  of  eigenvalues  (positive  or  negative)  has  the  largest  absolute  sum. 
Thus,  the  EST  projection  vectors  are  given  as 


West  =  [vi,  v2, . . . ,  vm] 

where  (i  =  1, . . . ,  ra)  are  the  m  most  significant  (either  positive  or  negative)  eigenvectors  or  M. 
Using  Equation  (14)  as  the  projection  vectors  and  substituting  this  result  into  Equation  (3)  gives 


EST(r)  =  (r  -  Ayf  (wBSTW|sr)  (r  -  Ay)- 


(14) 


(15) 


It  is  also  possible  to  project  onto  the  complement  subspace,  (i  —  W_estW^5T).  Thus,  substituting  Equation 
(14)  into  Equation  (4)  yields 


EST(r)  =  (r  -  Ay)T  (i  -  WESTWtESt )  (r  -  Ay)-  (16) 

Since  it  is  possible  to  use  either  the  positive  eigenvectors  or  the  negative  eigenvectors,  there  are  four  possible 
equations  that  can  possibly  be  used.  In  this  paper,  Equations  (15)  and  (16)  are  referred  to  as  the  ‘EST  Algorithm’ 
or  simply  ‘EST’.  In  the  experimental  results  in  Section  6,  only  the  best  results  among  the  four  possible  choices 
of  EST  are  presented. 


3.  KERNEL  LEARNING  THEORY 

Suppose  the  input  data  set  lies  in  the  data  space  (y  E  5ft J)  and  let  T  be  a  feature  space  (also  known  as  a  Hilbert 
space)  associated  with  y  by  some  nonlinear  mapping  function  <f>.  In  particular, 

$  :  X  ? 

x  i — >  <f>(x).  (17) 

where  x  is  an  input  vector  (x  E  x)  which  is  mapped  into  a  much  higher  dimensional  feature  space.  Mapping 
the  data  using  into  T  is  useful  in  many  ways.  The  most  significant  benefit  is  that  it  is  possible  to  define  a 
similarity  measure  using  the  dot  product  in  T  in  terms  of  a  function  of  the  corresponding  data  in  the  input 
space.  Thus,  it  is  possible  to  write 


fc(x;,Xj)  -  ($(xi),$(xj)>.  (18) 

Equation  (18),  which  is  commonly  referred  to  in  machine  learning  literature  as  the  kernel  trick,15  states  that  all 
dot  products  in  T  (a  task  which  is  otherwise  computationally  infeasible)  can  be  implicitly  computed  by  simply 
using  kernel  functions  defined  on  the  input  data.  Moreover,  all  of  this  can  be  accomplished  without  actually 
mapping  the  input  vectors  into  T .  Hence,  conveniently,  the  mapping  does  not  even  need  to  be  identified  nor 
defined.  In  other  words,  Equation  (18)  illustrates  that  all  dot  products  in  T  can  be  replaced  by  an  appropriately 
chosen  Mercer  kernel  function  A:.16  For  a  more  comprehensive  discussion  about  the  properties  of  various  types 
of  kernels  and  for  more  information  on  kernel-based  learning  in  general,  see  one  of  the  many  references  devoted 
to  kernel  methods.15, 16 

4.  KERNEL  SUBSPACE-BASED  ANOMALY  DETECTION 

In  this  Section,  each  of  the  linear  subspace-based  methods  in  Sections  2. 1-2.3  is  extended  into  the  feature  space 
T  and  then  kernelized  by  replacing  all  dot  products  in  the  feature  space  by  kernel  functions  using  the  kernel 
trick  in  Equation  (18).  The  following  subsections  present  a  derivation  of  each  of  the  kernelized  algorithms.  Using 
a  nonlinear  mapping  <I>,  the  original  data  in  the  input  space  defined  by  X  and  Y  are  mapped  into  the  feature 
space  T  and  denoted  by 

X*  =  $(X)  =  [$(x1)$(x2)...$(xJVlJ]  (19) 

Y*  =  $(Y)=[$(yi)$(y2)...$(yAroJ]  (20) 
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This  means  that  X<|>  represent  the  mapped  IWR  spectra  and  Y$  represents  the  mapped  OWR  spectra.  The 
statistical  means  of  the  mapped  data  in  T  are  represented  by  fix^  and  fiY^,  respectively.  For  many  of  the 
following  methods,  it  is  assumed  that  the  mapped  data  is  centered  in  T .  Thus,  denote  each  centered  vector  for 
the  IWR  in  T  as  <Fc(xi)  =  <f>(x^)  —  i  =  1?  •  •  • ,  Wn  and  similarly  for  the  OWR  spectra  (i.e.  -  $c(y  •)  = 

T(y^)  —  fiY^ ,  j  =  1, . . .  ,Nout).  Then,  let  Xc<3>  and  Yc^  be  matrices  whose  columns  are  the  centered  IWR  and 
OWR  in  the  feature  space,  respectively.  Also,  let  and  Cy^  be  the  covariance  matrices  of  the  centered 

spectra  in  the  feature  space.  The  projection  of  the  mapped  test  pixel  spectra  4>(r)  onto  a  linear  subspace  in  the 
feature  space  which  is  equivalent  to  a  nonlinear  subspace  in  the  original  input  domain  is  given  by 

s'  =  ($(r)  -  AYJTW$Wj($(r)  -  Ay*)  (21) 

where  W$  =  [w$  ]  is  a  matrix  whose  columns  are  the  set  of  m  projection  vectors  in  T .  Similarly, 

projection  onto  the  complement  subspace  is  given  by 

s'  =  ($(r)  -  AyJT  (i*  -  W*Wj)  (*(r)  -  Ay*)  (22) 

The  following  subsections  outline  three  methods  that  can  provide  the  projection  vectors  W$  =  [w^  ] 

in  order  to  perform  nonlinear  anomaly  detection  using  a  nonlinear  subspace.  The  methods  employed  are  simply 
nonlinear  extensions  of  each  of  the  three  algorithms  detailed  in  Section  2. 

4.1  Kernel  Principal  Component  Analysis 

In  this  section,  the  PC  A  method  is  mapped  into  the  feature  space  T  and  then  reformulated  solely  in  terms  of 
dot  products.  The  kernel  trick  is  then  utilized  to  help  to  make  the  problem  computationally  feasible.  As  in  the 
linear  PC  A  algorithm,  KPCA  can  be  formulated  using  either  the  IWR  or  OWR  spectra  to  formulate  the  KPCA 
projection  vectors  in  the  feature  space. 

In  order  to  find  the  PC  A  eigenvectors  in  the  feature  space,  simply  solve  the  mapped  version  of  Equation  (5); 
specifically,  the  eigenvalues  and  eigenvectors  of  Cy$  in  T  can  be  found  by  solving 

Cy.  =  V*A*V£  (23) 

where  A$  =  diag  (A$  A|  ...  A^,)  and  V$  =  [v$  v|  . . .  v^]  contain  only  the  p  nonzero  eigenvalues  and  cor¬ 
responding  eigenvectors  of  Cy$.  All  eigenvectors  in  the  feature  space  lie  in  the  span  of  the  vectors  in  Yc<&. 

Therefore,  they  can  be  represented  as 

V*  =  A*-1/2Yc*A  (24) 

where  A  =  [aq  a  2  •••  OLXout ]  is  a  matrix  whose  columns  are  the  nonzero  eigenvectors  of  the  centered  Gram 
(kernel)  matrix  Kc  and  contains  the  associated  nonzero  eigenvalues.  The  centered  kernel  matrix  can  be 
calculated  by  Kc  =  (K  —  livoutK  —  K1  Xout  +  ljvoutKljvout)  where  K  is  the  kernel  matrix  whose  elements  are 
(K )ij  =  fc(yi5  yj)  with  i,  j  =  1, . . . ,  Nout  and  (lwout)  is  an  Nout  x  Nout  with  each  element  equal  to  1  /Nout.  It  is 
known12  that  the  eigenvalue  decomposition  of  the  centered  kernel  matrix  is  given  by 

Kc  =  AA$At  (25) 

Utilizing  only  the  m  most  significant  eigenvectors  in  the  features  space  they  can  be  written  as 

Wpca  =  V$  =  Yc<3>A  (26) 

where  A  =  [aq  aq  *  •  *  am]  are  the  m  most  significant  eigenvectors  of  Kc  normalized  by  the  square  roots  of 
their  respective  eigenvalues.  The  vectors  V$  are  then  used  as  projection  vectors  in  the  feature  space  (W$)  in 
Equation  (21).  Substituting  Equation  (26)  into  Equation  (21)  gives 

KPCA(r)  =  (^(r)-AYJ)T(v^)(^(r)-AYJ. 

=  (4>(r)  —  AYcJTYc<i,  Appca  AkpcayJ$  ($(r)  —  Ay*)-  (27) 
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For  notational  simplicity,  let  Kyr  =  (I)(r)i  Y<:,(>  and  Ky^  =  /ly.^  ;  these  are  commonly  referred  to  as 

empirical  kernel  expansions.  Substituting  these  results  into  Equation  (27)  results  in 

KPCA(r)  =  ^Kyr  -  K^A)  T  AKPCaATkpca  (K^r  -  k£a)  .  (28) 

~  T 

It  should  be  pointed  out  that  as  in  the  linear  case,  we  can  also  project  onto  the  complement  subspace  (I  — V$V$) 
in  the  feature  space.  As  mentioned  above,  the  KPCA  algorithm  can  be  formulated  using  the  IWR  spectra  rather 
than  the  OWR  spectra  to  generate  the  projection  vectors  W$  in  the  feature  space. 


4.2  Kernel  Fisher  Discriminant  Analysis 


It  has  been  shown13, 15  how  to  extend  FLD  analysis  to  its  nonlinear  version  by  using  the  kernel  trick  to  compute 
the  Fisher  discriminant  in  the  feature  space.  Defining  FLD  in  the  feature  space  is  equivalent  to  maximizing  the 
cost  function  given  by 


J(w$)  = 


Tc$ 


SBw$| 


(29) 


where  w$  is  the  projection  vector,  =  Cx$  -f 
and  between-class  scatter  matrices,  respectively,  in  T. 


Cy4  and  S|  =  -  AyJ(Ax,,  ~  AyJT  are  the  within-class 


Finding  an  optimal  w$  by  maximizing  Equation  (29)  is  not  mathematically  tractable  considering  the  simple 
fact  that  the  feature  space  is  of  high  (possibly  infinite)  dimensionality.  Fortunately,  we  can  reformulate  this 
problem  in  terms  of  dot  products  in  the  feature  space  and  then  replace  them  with  kernel  functions.  Based  on 
reproducing  kernel  theory,  any  solution  w$  to  Equation  (29)  can  be  expanded  as 


Ntot 

Wcj>  =  ai<&(z>i)  =  Z$a  (30) 

i=  1 

where  NTOT  =  Nin  +  Nout  and  Z$  =  [zi ,  z2  , . . .  ,  ^nTOt\  =  [$c(*i) ,  •  •  •  ,  ^c(xyj  ,  $c(Yi)  ,  ■  •  ■  ,  ®c{yNout)\  is 
a  matrix  whose  columns  are  the  mapped  vectors  in  T  of  the  corresponding  spectra  in  both  the  IWR  and  OWR 
concatenated  together  and  a  is  the  KFD  vector  in  the  feature  space. 

Combining  the  definition  of  /xx<j>  and  Equation  (30)  yields 

w$Ax*  =  oiTMin  (31) 

where  (M in)j  =  N—  &(XUX0-  Similarly,  using  the  definition  of  and  Equation  (30)  gives 

w5Ay*  =  OiTmout  (32) 

where  (WLout)j  =  N—  *  &(y jiYi)-  Notice  that  the  second  equations  in  both  expansions  above  are  the  direct 

result  of  using  the  kernel  trick  (Equation  (18)). 

By  using  the  definition  of  Sf)  and  Equations  (31)  and  (32),  the  numerator  of  Equation  (29)  can  be  written  as 

wJSgW<£  =  olt  Aol  (33) 

where  A  =  (Min  —  M.out)(Min  —  Mout)T.  Using  a  similar  argument,  the  denominator  of  Equation  (29)  can  be 
rewritten  as 

=  aTBa  (34) 

where  B  =  Kin(I  -  lwin)K^  +  Kont(I  -  lNout)K^ut,  I  is  the  identity  matrix,  Kin  is  an  NTOT  x  Nin  Gram 
matrix,  K out  are  Ntot  x  Nout  Gram  matrix,  and  1  Nin  and  1  Nout  are  matrices  with  each  entry  equal  to  1/Wn 
and  l/Nout,  respectively.  For  example,  each  element  of  K in  and  Kout  are  defined  to  be 


(K^n)r 
(K  out)r 


fc(xn,xm) 

fc(ynAm)- 
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ORGANIZATION 


A  combination  of  Equations  (33)  and  (34)  means  that  Fisher’s  discriminant  in  T  can  now  be  found  by 
maximizing 


J(«) 


olt  Aol 
aTBa 


(35) 


As  in  the  solution  to  the  analogous  problem  in  the  input  space  (Equation  (9)),  Equation  (35)  can  be  solved 
simply  by  finding  the  leading  eigenvector,  olkfd ,  of  B_1A.  Thus,  the  identity  in  Equation  (30)  becomes 
w$  =  w  =  Z &OLKFD- 

Substituting  this  result  into  Equation  (21),  gives 


KFD(r)  =  (4>(r)  -  Ay JT  KwfJ  ($(r)  -  AyJ- 

=  (4>(r)  —  AY*)T^4>a^F£>Q;^F£)Zj($(r)  —  Ay*)- 

To  simplify  this  equation,  let 

$(r)TZ*  =  $(r)T  [$(Xl)  $(x2)  . . .  $(XjVtot)]  =  k  (Z,  r)T  =  KZr 


(36) 


(37) 


and 


~  1 

Z([,  —  — — 
*  Nmi 


y,  $(y)T  [$(zi)  $(z2)  . . .  $(zjvT 


yEOWR 

Using  Equations  (37)  and  (38),  Equation  (36)  becomes 

k  T 


N„ 


E  k(Z,y)T  =  KZM. 


y  EOWR 


KFD(r)  =  (K|r  -  k£m)  cxkfdcxtkfd  (k£.  -  K^) 

Equation  (39)  is  the  equation  used  for  the  Kernel  Fisher  Discriminant  algorithm  in  this  paper. 


(38) 


(39) 


4.3  Kernel  Eigenspace  Separation  Transform 

In  this  section,  EST  is  defined  in  the  feature  space  T  and  then  reformulated  solely  in  terms  of  dot  products. 
Once  again,  the  kernel  trick  is  utilized  to  convert  it  into  its  kernel  version.  The  difference  correlation  matrix 
(DCOR)  R  for  the  input  data  in  the  feature  space  can  be  written  as 


R$ 


Rx*  -Ry*=7E$(X)$(X)t-  J_$(Y)<J>(Y)t 

^in  ^ out 


[  $(X)  -$(Y) 


$(X)T/7Vin 
$(Y  )T/Nout 


(40) 


Here,  the  correlation  matrix  in  the  feature  space  of  the  first  class  is  Rx*  =  ‘I’  ( X )  N  X  )7  /N, n  and  likewise,  the 
correlation  matrix  in  the  feature  space  of  the  second  class  is  Ry*  =  'NY) <1> (Y)T / Nou , .  The  eigen  decomposition 
of  DCOR  in  the  feature  space  can  be  rewritten  in  block-matrix  form  in  terms  of  its  positive  and  negative 
eigenvalues  and  eigenvectors  as 


R$  =  [  V+#  V_# 


A+* 

0 


(41) 


where  the  columns  of  V+(J)  and  V-^  are  the  eigenvectors  in  the  feature  space  with  their  corresponding  non-zero 
positive  (A+#)  and  negative  (A_^)  eigenvalues,  respectively.  In  order  to  diagonalize  the  DCOR  matrix  R$  we 
must  find  all  eigenvectors  (both  positive  and  negative)  V$  and  all  nonzero  eigenvalues  A$  which  satisfy  the 
equation 

A$V$  =  Rc[>Vci>.  (42) 
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Kernelization  of  EST  in  the  Feature  Space 

Due  to  the  (possibly)  extreme  high-dimensionality  of  the  feature  space,  (42)  cannot  be  explicitly  solved.  In  order 
to  circumvent  this  problem,  the  equation  can  be  kernelized  by  writing  it  in  terms  of  kernel  functions.  Doing  so 
allows  us  to  implement  the  equation  in  the  original  input  domain  in  terms  of  kernel  functions. 

Each  eigenvector  v|  in  the  feature  space  can  be  written  as  a  linear  combination  of  the  centered  input  data 


as 


v*  =  ■ 


1  Nin  _  1  1  NOUt  1 


2=1 


3  = 1 


1  $(X)afcAd-  1  <S>(Y)f3kAj 


\i Nin 


VNC 


(43) 


out 


where  the  expansion  coefficients,  or  and  f3  ,  are  defined  as  or  =  (oq ,  cd>, . . . ,  <a^)T  and  P  =  (n?  ,  •  •  • ,  PnouJ 

for  k  =  1, . . .  ,Nt  where  Nt  =  Nin  +  A^ont-  Equation  (43)  can  be  used  to  write  all  eigenvectors  with  non-zero 
eigenvalues  as 


V* 


*  V<E> 


4>(X)AA“#  -4>(Y)BAI 
[  4>(X)  — 4>(Y) 

[  <f>(X)  — <f>(Y)  ]  D 


A 

A_d  o 

B 

o  a:^ 

(44) 


where 


A 

a1 

cx2 

cxN *  1 

A 

X/N- 

V  1  v  m 

y/  Nin 

y/  Nin 

B 

(31 

(32 

(3n± 

y/  N  out 

y/ Nou± 

y/Nout  _ 

A  = 


A+ 

0 


0 

A_ 


and  the  columns  of 


D 


A 

A+*  0 

cx 1 

cx2 

cxN*  1 

A 

y/N~ 

y/N~ 

X/N- 

V  1  v  m 

B 

o  a:1 

(31 

P2 

/ sN t 

y/  Nout 

y /Nout 

y/  N  out  _ 

[A]' 


(45) 


represent  the  eigenvectors  of  a  kernel  matrix  associated  with  the  kernelized  version  of  EST  (shown  below).  By 
substituting  equations  (40)  and  (44)  into  (42)  and  using  the  kernel  trick  from  Equation  (18)  to  simplify  we  obtain 


A[$(X)  — 4>(Y)]D  =  [<h(X)  — 4>(Y)] 


Kxx  _  KXy 

Nin  Nin 

Kyx  Kyy 

NOUt  N out 


D 


(46) 


where  Kxx  =  4>(X)T<f>(X)  is  an  Nin  x  Nin  kernel  (Gram)  matrix,  Kyy  =  4>(Y)T<h(Y)  is  an  Nout  x  Nout  kernel 
matrix,  Kxy  —  4>(X)T<f>(Y)  is  an  Nin  x  Nout  kernel  matrix,  and  Kyx  =  4>(Y)T4>(X)  is  an  Nout  x  A^n  kernel 
matrix.  Each  of  the  entries  in  all  four  matrices  is  obtained  in  terms  of  the  kernel  function  k. 

Multiplying  both  sides  of  (46)  by  [4>(X)  <h(Y)]T  and  again  using  (18)  to  simplify  produces 


Kxx 

Kxy  1 

Kxx 

Kxy  1 

Nin 

Nin 

D  = 

Nin 

Nin 

Kyx 

Kyy 

Kyx 

Kyy 

N0ut 

N0ut 

N out 

N  out 

D. 


(47) 
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Solving  equation  (47)  is  tantamount  to  finding  the  eigenvectors  and  eigenvalues  of  the  kernel  matrix 


K 


REST 


Kxx 

Nin 

Kyx 

Nout 


Kxy 

Nin 

Kyy 

Nout 


=  DAD 


(48) 


and  normalizing  each  of  the  eigenvectors  by  the  square  root  of  its  associated  eigenvalue.  Here,  the  columns  of 


the  matrix  D  = 


di  d2 


*Nt 


represent  the  positive  and  negative  eigenvectors  of  the  KEST  kernel  matrix, 


K 


REST ■ 


Then,  D  = 


di  d2 

y/Xi  VX2 


where  A  =  diag  (Ai,  A2, . . . , A Nt)  .Equivalently,  they  are  the  expansion 


coefficients  in  (43).  The  corresponding  positive  and  negative  eigenvalues  are  contained  in  the  diagonal  matrix  A. 
For  simplicity,  the  eigenvalues  and  corresponding  eigenvectors  should  be  ordered  from  most  positive  significant 
to  most  negative  significant. 

Let  the  KEST  projection  vectors,  W rest  vectors  be  either  the  first  m  positive  or  negative  eigenvectors  of 
D.  Thus,  either 


W  REST  =  W  KEST 
W  REST  =  W  KEST 


=  [di  d2  •  •  •  dm] 

=  [djvt  d/vt-i  •  •  •  d^-m+i] 


(49) 


where,  as  with  KPCA,  m  is  a  configurable  constant.  The  choice  of  using  most  positive  significant  or  most 
negative  significant  is  a  data  dependent  choice  and  is  determined  using  the  procedure  outlined  for  the  linear  EST 
method  in  Section  2.3. 

Substituting  Equation  (49)  for  D  in  Equation  (44)  (i.e.  -  only  using  the  first  m  positive  or  negative  eigen¬ 
vectors)  and  using  this  result  as  the  projection  vectors,  W$,  in  Equation  (21)  yields 


KEST(r)  =  ($(r)-AYJT(v$V5)($(r)-AYJ. 

=  ($(r)  -  liyJt<S>(Z)WKEstWtkest<S>(Z)t  ($(r)  -  AyJ  • 

where  <f>(Z)  =  [4>(X)  —  4>(Y)].  For  notational  convenience,  let 


(50) 


$(r)T$(Z)  =  $(r)T  [$(xi),...,4>(xJVin),-$(y1),...,-$(yJVout)] 

=  [fc(xi,  r), . . . ,  k(xNin,  r),  -k(y1,  r), . . . ,  ~k(yNout,r)] 

=  k(Z,rf  =  Kzr  (51) 


where  the  second  equal  sign  is  as  a  direct  result  of  using  the  kernel  trick  in  Equation  (18).  The  vector  k(Z,r)T 
is  commonly  referred  to  as  the  empirical  kernel  map  of  an  input  vector  r.15  Similarly,  define 


AY**(Z) 


1 

Nout 

1 

Nout 


y  eOWR 

E  k(Z,y)T  =  K2A. 

y  eOWR 


(52) 


Using  Equations  (51)  and  (52),  Equation  (50)  becomes 

KEST(r)  =  (KL  -  K|a)  T  WkestWtkest  (k^p  -  K|a)  (53) 

As  in  the  case  of  linear  EST  in  Section  2.3,  as  well  as  in  KPCA,  it  is  also  possible  to  project  onto  the  complement 
subspace  (I  —  V$Vj)  in  the  feature  space  as  an  extension  of  Equation  (4). 
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5.  RX  AND  KERNEL-RX  ANOMALY  DETECTORS 

The  RX  anomaly  detector  introduced  by  Reed  and  Yu5  has  become  the  benchmark  for  hyperspectral  anomaly 
detection  because  of  its  natural  assumption  that  neither  the  target  spectrum  nor  the  covariance  matrix  of  the 
background  clutter  need  to  be  known.  The  RX-algorithm  is  based  on  comparing  the  difference  between  the 
test  spectrum  and  the  spectra  of  the  immediate  background  samples.  It  is  similar  to  the  Mahalanobis  distance 
measure  and  is  given  by 

RX(r)  =  (r-Ay)TCy(r-Ay).  (54) 

where  r  is  the  test  sample,  fiY  and  Cy  are  the  spectral  mean  and  covariance  of  the  background  clutter  samples 
in  the  OWR.  Similarly,  the  RX-algorithm  can  be  defined  in  the  feature  space  as 

RX($(r))  =  ($(r)  -  AyJTcC($(r)  -  A yj  (55) 

where  C y#  is  the  estimated  covariance  matrix  of  the  background  clutter  and  //y  is  the  mean  of  the  background 
clutter  samples  in  the  feature  space.  Equation  (55)  corresponds  to  a  linear  detector  in  the  feature  space;  however, 
it  corresponds  to  a  nonlinear  detector  in  the  original  input  space.  Unfortunately,  Equation  (55)  cannot  be  directly 
implemented  because  of  the  high  dimensionality  of  the  feature  space.  The  kernel  version  of  the  RX  algorithm 
was  obtained  in6  and  is  given  as 

KRX(r)  =  (Kyf  -  KIy)T K"1  (K^r  -  )  .  (56) 

where  Kyr  =  ^(r)TYc^,  K and  Ky  is  the  estimated  centered  kernel  matrix.  Equation  (56)  is  the 
kernel-RX  equation  used  in  this  paper. 


6.  RESULTS 

In  this  Section,  each  of  the  eight  equations  is  implemented  using  both  simulated  illustrative  toy  data  as  well 
as  real  hyperspectral  imagery  from  the  Hyperspectral  Digital  Imagery  Collection  Experiment  (HYDICE)  and 
Airborne  Hyperspectral  Imager  (AHI)  data  sets.  For  this  paper,  we  use  a  Gaussian  Radial  Basis  Function  (RBF) 
kernel  which  takes  the  form  &(x,  y)  =  exp[— |  |x  — y|  |2/2cr2]  where  a  >  0  is  a  critical  kernel  parameter  representing 
the  width  of  the  Gaussian  kernel.  This  parameter  must  be  chosen  so  that  the  RBF  function  can  full  exploit  the 
data  variations.  In  this  paper,  the  value  of  a  was  determined  experimentally  for  each  algorithm  and  for  each 
image  using  a  cross-validation  technique.  Performance  results  using  ROC  curve  analysis  for  each  of  the  eight 
methods  are  provided  and  compared. 

6.1  Simulated  Data 

Each  of  the  eight  algorithms  are  implemented  here  as  discrimination  methods  on  an  illustrative  toy  data  set. 
The  data  set,  shown  in  Figure  2(a),  consists  of  two  nonlinear  Gaussian  mixtures.  Class  1  is  represented  by  the 
red  (*)  points;  Class  2  is  represented  by  the  blue  (o)  points.  It  is  clear  from  this  figure  that  no  linear  separating 
hyperplane  can  be  placed  that  perfectly  separates  the  two  data  classes. 

In  order  to  implement  the  algorithms,  Class  1  and  Class  2  were  defined  as  the  two  sets  corresponding  to  the 
data  in  the  IWR  and  OWR  of  a  fictional  dual  window.  Extending  the  problem  to  an  anomaly  detection  setting, 
Class  1  represents  the  target  data  and  Class  2  represents  the  background  data.  The  results  using  each  of  the 
methods  on  the  simulated  data  set  are  shown  in  Figures  2(b)-2(i).  To  improve  visual  quality,  the  points  in  Class 
2  are  now  yellow  (o).  For  the  nonlinear  algorithms,  the  kernel  parameter  a  was  experimentally  determined  and 
set  equal  to  a  value  which  provided  a  decent  looking  result.  The  green  lines  in  Figures  2(c),  2(d),  and  2(e)  are 
the  projection  vectors  used  in  each  case.  The  blue  contour  lines  are  decision  boundaries  at  different  thresholds. 
The  shading  defines  relative  projection  separation  values;  lighter  shading  means  a  larger  projection  separation 
value  which  in  turn  implies  a  higher  likelihood  that  point  will  be  classified  as  an  anomaly.  Similarly,  darker  areas 
correspond  to  points  which  are  more  likely  to  be  classified  as  background  clutter. 

It  is  strikingly  clear  that  all  four  of  the  nonlinear  methods  have  significantly  better  discrimination  abilities 
than  their  linear  counterparts.  Each  of  the  four  nonlinear  methods  generate  decision  boundaries  which  very 
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(a) 


Figure  2.  (a)  Original  Simulated  2-D  Data  Set.  A  mixture  of  two  nonlinear  Gaussian  distributions.  The  red  points  (*) 
represent  the  data  in  Class  1  and  the  blue  (o)  represent  the  data  in  Class  2.  Contour  and  surface  plots  for  the  2-D 
simulated  data  set  using  (b)  RX,  (c)  PCA,  (d)  FLD,  (e)  EST,  (f)  KRX,  (g)  KPCA,  (h)  KFD,  and  (i)  REST. 


nicely  conform  to  the  overall  shape  of  the  distribution.  While  it  is  difficult  to  actually  compare  the  performance 
among  the  four  nonlinear  algorithms,  it  is  nonetheless  easy  to  see  that  the  nonlinear  methods  perform  better 
detection  than  the  linear  methods. 

6.2  Hyperspectral  Imagery 

Three  real  hyperspectral  images  from  two  different  HSI  sensor  databases  were  used  to  compare  the  performances 
of  the  eight  algorithms  outlined  above.  Two  of  the  images  are  from  the  Hyperspectral  Digital  Imagery  Collection 
Experiment  (HYDICE)  data  set  and  the  third  is  from  the  University  of  Hawaii’s  Airborne  Hyperspectral  Imager 
(AHI)  sensor. 

Before  any  processing  is  done,  all  spectra  in  each  image  are  normalized  so  that  all  values  in  the  data  cube  lie 
between  zero  and  one.  The  normalization  factor  is  calculated  as  the  largest  value  among  all  spectral  components 
in  each  hyperspectral  image.  This  normalization  helps  to  effectively  use  the  dynamic  range  of  the  RBF  kernel.6 
In  all  algorithms  a  dual  window  was  used  to  collect  data.  To  provide  consistency,  an  IWR  of  7x7  pixels,  a  guard 
band  of  9x9  pixels,  and  an  OWR  of  19x19  pixels  were  used  for  all  algorithms  and  for  all  images.  It  was  stated 
that  the  IWR  size  should  be  about  as  large  as  the  biggest  target  in  the  image.  This  is  more  or  less  the  case 
for  all  images.  The  size  of  the  OWR  was  chosen  such  that  there  are  a  sufficient  number  of  vectors  available  for 
further  processing. 

In  order  to  compare  the  performances  of  each  of  the  methods,  receiver  operating  characteristic  (ROC)  curves 
were  generated  based  on  ground  truth  information  obtained  from  each  image.  The  ROC  curves  provide  a  visual 
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quantitative  comparison  by  plotting  the  probability  of  correct  detection,  Pd,  versus  the  false  alarm  rate,  Rfa • 
For  each  hyperspectral  image,  ground  truth  was  obtained  by  determining  the  locations  of  all  pixels  in  the  image 
which  correspond  to  a  target  to  be  detected.  The  probability  of  detection  is  defined  as  Pd  =  and  the  false 
alarm  rate  is  calculated  by  Rfa  =  where,  at  each  threshold  T,  N^u  is  the  number  of  pixels  correctly 

identified  as  target,  Nt  is  the  total  number  of  target  pixels  in  the  ground  truth  for  that  image,  Nmiss  is  the 
number  of  pixels  incorrectly  labeled  as  targets,  and  Ntp  is  the  total  number  of  pixels  in  the  image.  For  visual 
purposes,  all  outputs  shown  below  have  been  binary  thresholded  at  a  value  which  corresponds  to  an  80%  detection 
rate  for  that  image. 

6.3  HYDICE  Imagery 

The  HYDICE  sensor  collects  radiance  information  over  a  spectral  range  spanning  the  VNIR  and  SWIR  frequency 
ranges  (0.4  -  2.5  /im).  Each  band  is  approximately  10  nm  wide  generating  a  spectral  resolution  consisting  of 
210  spectral  bands.  Due  to  water  absorption  and  low  signal-to- noise  ratio  (SNR),  only  150  of  those  bands  are 
actually  used  here;  bands  1-22,  102-108,  137-151,  and  195-210  have  been  removed.  The  two  HYDICE  images 
used  in  this  thesis  are  the  Desert  Radiance  (DR-II)  and  Forest  Radiance  (FR-I)  data  sets.  The  DR-II  image 
consists  of  6  ‘targets  of  interest’  on  a  dirt  road  running  through  a  dusty  terrain  with  light  vegetation.  The  FR-I 
image  has  14  ‘targets  of  interest’  in  a  grassy  field  situated  near  a  dense  forrest.  The  DR-II  and  FR-I  images  are 
shown  in  Figures  3(a)  and  4(a),  respectively. 

6.3.1  DR-II  Results  and  Analysis 

The  ground  truth  for  the  DR-II  image  is  shown  in  Figure  3(b).  It  clearly  shows  the  location  of  the  six  ‘targets 
of  interest’.  All  eight  algorithms  were  implemented  for  this  image  and  the  best  outputs  for  each  can  be  seen  in 
Figures  3(c)  -  3(j).  The  results  shown  are  the  best  results  obtained  using  the  eight  detectors  outlined  above. 
For  PC  A,  the  first  6  eigenvectors  using  the  OWR  spectra  were  used  in  Equation  (8).  For  KPCA,  the  first  6 
eigenvectors  using  the  OWR  spectra  were  used  in  the  complement  subspace  form  of  Equation  (28).  For  EST 
and  KEST,  the  first  3  positive  eigenvectors  were  used  in  Equations  (15)  and  (53),  respectively. 

The  ROC  curves  at  low  FAR  for  each  of  the  eight  methods  can  be  seen  in  Figure  5(a).  From  these  results, 
it  appears  that  each  of  the  four  nonlinear  methods  performs  better  than  its  respective  linear  counterpart.  In 
addition,  all  four  nonlinear  detectors  aggregately  exhibit  better  results  than  all  four  linear  detectors.  At  low 
FAR,  KPCA  in  this  situation  performs  best  among  all  methods  followed  by  KRX,  KEST  and  KFD.  Among  the 
linear  methods,  PCA,  EST,  and  RX  all  perform  about  the  same  with  FLD  clearly  performing  the  worst  out  of 
all  the  detectors. 

6.3.2  FR-I  Results  and  Analysis 

The  ground  truth  for  the  FR-I  HYDICE  image  is  shown  in  Figure  4(b).  It  clearly  shows  the  location  of  the 
fourteen  ‘targets  of  interest’.  All  eight  algorithms  were  implemented  for  this  image  and  the  best  outputs  for 
each  can  be  seen  in  Figures  4(c)  -  4(j).  The  results  shown  are  the  best  results  obtained  using  the  eight  detectors 
outlined  above.  For  PCA,  the  first  6  eigenvectors  using  the  OWR  spectra  were  used  in  Equation  (8)  and  for 
KPCA,  the  first  6  eigenvectors  using  the  OWR  spectra  were  used  in  the  complement  subspace  form  of  Equation 
(28).  For  EST,  the  first  three  negative  eigenvectors  were  used  in  Equation  (16)  and  for  KEST  the  first  3  positive 
eigenvectors  were  used  in  Equation  (53). 

The  ROC  curves  for  each  of  the  eight  algorithms  at  low  FAR  are  shown  in  Figure  5(b).  From  these  results, 
it  is  clear  that  KPCA  performs  the  best  among  all  eight  algorithms  for  this  image  since  it  detects  very  few 
background  clutter  regions.  The  results  for  KFD,  KEST,  and  PCA  also  appear  to  perform  very  well  with 
slightly  more  false  alarms  appearing  at  this  detection  rate.  At  very  low  Rfa,  PCA  outperforms  all  algorithms 
except  KPCA  and  KFD.  KEST  appears  to  perform  much  better  than  EST  for  this  image  as  EST  exhibits  a  large 
amount  of  false  alarms  around  the  treeline  region.  A  region  similar  to  this  one  could  prove  to  be  problematic  for 
anomaly  detectors  as  there  is  an  abrupt  change  from  foliage  material  to  a  shadowed  grassy  region.  Once  again, 
the  result  using  FLD  is  the  poorest  among  all  detectors.  The  results  for  this  image  indicate  that  each  of  the 
nonlinear  algorithms  performs  better  compared  with  its  respective  linear  version.  However,  since  PCA  performs 
well  for  this  image,  it  cannot  be  said  that  all  nonlinear  versions  as  a  whole  perform  this  task  better  than  the 
four  linear  methods. 
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(i)  (j) 

Figure  3.  (a)  Original  DR-II  HYDICE  image,  (b)  Ground  truth  for  the  DR-II  HYDICE  image.  Output  results  at  80% 
detection  rate  using  (c)  RX,  (d)  PCA,  (e)  FLD,  (f)  EST,  (g)  KRX,  (h)  KPCA,  (i)  KFD,  and  (j)  REST. 

6.4  AHI  Imagery  and  Results 

The  third  image  is  from  Hawaii’s  Airborne  Hyperspectral  Imagery  (AHI)  database.  This  hyperspectral  cube 
contains  70  spectral  bands  and  spans  the  long- wave  infrared  (LWIR)  frequency  range  (8  -  11.5  fi m).  Thus,  a 
spectral  resolution  of  50  nm  is  provided  by  the  sensor.  The  AHI-1  image  used  in  this  paper  is  shown  in  Figure 
6(a).  The  ground  truth  for  the  AHI-1  image  is  shown  in  Figure  6(b).  It  shows  the  locations  of  the  thirty- five 
‘targets  of  interest’  -  mines  in  this  case.  All  eight  algorithms  were  once  again  implemented  for  this  image  and  the 
best  outputs  for  each  can  be  seen  in  Figures  6(c)  -  6(j).  The  results  shown  are  again  the  best  results  obtained 
using  the  eight  detectors  outlined  above.  For  PCA,  the  first  six  eigenvectors  using  the  OWR  spectra  were  used 
in  Equation  (8)  and  for  KPCA,  the  first  six  eigenvectors  using  the  OWR  spectra  were  used  in  the  complement 
subspace  form  of  Equation  (28).  For  EST,  the  first  five  positive  eigenvectors  were  used  in  Equation  (15)  and  for 
KEST  the  first  five  positive  eigenvectors  were  used  in  Equation  (53). 

The  ROC  curves  for  each  of  the  eight  algorithms  at  low  FAR  are  shown  in  Figure  5(c).  From  the  images  and 
ROC  curves  it  can  be  seen  that  for  this  image  at  low  FAR,  KFD  performs  the  best  among  all  methods  while  RX 
clearly  really  suffers  from  a  large  number  of  false  alarms  and  thus  exhibits  the  worst  detection  performance.  In 
general,  at  low  false  alarm  rates,  KFD,  KPCA,  and  KRX  all  perform  better  than  the  other  detectors.  However, 
at  very  low  FAR  FLD  actually  achieves  a  higher  detection  rate  than  KEST.  Nonetheless,  each  nonlinear  detector 
performs  at  a  higher  level  than  its  linear  counterpart.  However,  the  performance  increase  from  each  linear 
detector  compared  to  its  corresponding  nonlinear  detector  is  not  as  significant  as  in  the  HYDICE  images.  The 
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(i)  (j) 

Figure  4.  (a)  Original  FR-I  HYDICE  image,  (b)  Ground  truth  for  the  FR-I  HYDICE  image.  Output  results  at  80% 
detection  rate  using  (c)  RX,  (d)  PCA,  (e)  FLD,  (f)  EST,  (g)  KRX,  (h)  KPCA,  (i)  KFD,  and  (j)  REST. 


(a)  (b)  (c) 

Figure  5.  ROC  curves  for  the  (a)  DR-II  (b)  FR-I  and  (c)  AHI-1  image  at  low  false  alarm  rates. 


reason  for  this  is  most  likely  explained  by  the  large  anomalous  areas  detected  on  the  left  side  of  the  images  in 
Figure  6.  This  region  corresponds  to  the  darker  regions  in  Figure  6(a).  Further  analysis  leads  to  the  conclusion 
that  the  terrain  of  these  areas  are  vastly  different  spectrally  than  the  background;  that  is,  the  spectral  properties 
of  the  dark  region  differ  greatly  from  those  of  the  background  immediately  surrounding  this  area.  This  explains 
why  these  pixels  are  labeled  as  anomalies  in  the  detector  outputs  and  why  a  large  number  of  false  alarms  are 
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(i)  (j) 

Figure  6.  (a)  Original  AHI-1  image,  (b)  Ground  truth  for  the  AHI-1  image.  Output  results  at  80%  detection  rate  using 
(c)  RX,  (d)  PCA,  (e)  FLD,  (f)  EST,  (g)  KRX,  (h)  KPCA,  (i)  KFD,  and  (j)  REST. 


generated  in  this  region.  While  they  are  in  fact  anomalies  (with  respect  to  the  background),  they  are  not 
considered  targets.  Thus,  the  nonlinear  detectors  suffer  greatly  from  false  alarms  in  this  region,  hindering  their 
overall  detection  rates.  From  Figure  6(i),  it  can  be  seen  that  KFD  does  not  generate  a  lot  of  false  alarms  in  this 
region,  helping  it  to  achieve  a  higher  detection  performance  than  all  other  detectors  for  this  image. 

7.  CONCLUSIONS 

This  paper  provided  a  performance  characterization  of  nonlinear  kernel-based  methods  for  hyperspectral  anomaly 
detection.  Four  linear  algorithms  were  used  to  generate  projection  vectors  onto  which  samples  from  the  inner 
window  region  and  outer  window  region  of  a  dual  window  centered  at  the  test  pixel  were  projected.  Each  of  these 
algorithms  was  then  mapped  into  a  high-dimensional  feature  space  in  an  attempt  to  exploit  the  higher-order 
correlation  between  the  spectral  characteristics  of  the  pixels.  The  nonlinear  algorithms  in  the  feature  space  then 
needed  to  be  rewritten  in  terms  of  kernels  functions  of  the  data  in  the  original  input  space.  All  eight  anomaly 
detection  algorithms  were  briefly  explained  and  implemented  using  three  hyperspectral  data  cubes  containing  a 
varying  number  of  ‘targets  of  interest’. 

The  results  from  the  three  hyperspectral  images  did  provide  a  rough  sense  that  the  kernel-based  algorithms 
could  achieve  better  detection  levels  than  their  respective  linear  methods.  Further  research  should  examine  the 
task  of  optimizing  the  parameters  used  in  these  algorithms  (i.e.  -  kernel  parameter,  number  of  eigenvectors  used 
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for  projection  vectors,  dual  window  size,  etc.).  In  addition,  more  hyperspectral  imagery  is  being  tested  in  order 
to  formulate  a  more  accurate  comparison  of  the  algorithms. 
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